Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.x](backport #41817) [aws] [s3] Introduce ignore_older & start_timestamp for S3 input allowing better registry cleanups #42246

Merged
merged 1 commit into from
Jan 7, 2025

Conversation

mergify[bot]
Copy link
Contributor

@mergify mergify bot commented Jan 7, 2025

Proposed commit message

Introduce ignore_older and start_timestamp properties to AWS S3 input. This is a follow-up for #41694.

The configurations introduced here act as input object filters. If the object fails to match derived filters, the entries will be cleaned up from the registry, reducing filebeat memory consumption.

Introduced configurations are,

  • ignore_older : Accepts a time duration in which entries are accepted for processing
  • start_timestamp: A timestamp from which objects are accepted for processing

For both inputs, the object's last modified timestamp is taken into comparison. See Use cases section for further explanation

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

None as defaults are disabled. However, when configurations introduced here are used, the following can have an impact on the user,

  • Whenstart_timestamp is defined, then objects with the last modified timestamps prior to the timestamp are ignored from processing (documented 1)
  • When ignore_older is defined, then objects that do not fall within the look-back period when processing starts (polling run) are ignored (documented 1)
  • When both start_timestamp & ignore_older are defined, the initial run will process all entries up to start_timestamp. The subsequent runs will not include entries that do not fall within ignore_older even if processing failed for an object. (documented 1)

How to test this PR locally

  • Build filebeat from the changest included in the PR
  • Source S3 bucket with objects (you may use this tool 2 to create entries)
  • Try configuring filebeat with alternative values for ignore_older & start_timestamp to see how data ingestion change with their values. See Use cases section for further explanation

Related issues

Use cases

Consider below diagrams where there're 3 objects Object A, Object B and Object C with their last modified timestamps of t1, t2 and t3.

And consider how filebeat processes and tracks registry entries based on the following scenarios

Default behavior

If none of the configurations are used, then filebeat will process and the internal registry will track all objects continuously unless they are removed from the bucket.

image

Use start_timestamp

If start_timestamp is used, objects newer than the timestamp are accepted for processing. The registry will grow unless objects are removed from the bucket by other means (ex:- lifecycle policy).

image

Use ignore_older

If ignore_older is defined, input will process objects within the provided duration, calculated from the current time. The registry will track objects within the current timeframe and others will get cleaned up eventually by subsequent runs.

image

Use both ignore_older & start_timestamp

If both properties are defined,

  • The initial run will include entries within the start_timestamp (ignoring ignore_older duration).
  • Subsequent runs will only consider entries within the ignore_older duration.

image


This is an automatic backport of pull request #41817 done by [Mergify](https://mergify.com).

Footnotes

  1. https://github.com/elastic/beats/pull/41817/files#diff-422765b7341c5bbf6de7af38927e34e00a5073b188585a7af3c4fee1175b64a6 2 3

  2. https://github.com/Kavindu-Dodan/data-gen

…wing better registry cleanups (#41817)

* add changelog entry

Signed-off-by: Kavindu Dodanduwa <[email protected]>

* sort config entries

Signed-off-by: Kavindu Dodanduwa <[email protected]>

* introduce ignore old and start timestamp configurations and document them

Signed-off-by: Kavindu Dodanduwa <[email protected]>

* add filtering logic

Signed-off-by: Kavindu Dodanduwa <[email protected]>

* filter tests

Signed-off-by: Kavindu Dodanduwa <[email protected]>

* add component test for filtering and fix lint issues

Signed-off-by: Kavindu Dodanduwa <[email protected]>

# Conflicts:
#	x-pack/filebeat/input/awss3/s3_test.go

* add changelog entry

Signed-off-by: Kavindu Dodanduwa <[email protected]>

* improve documentation

Signed-off-by: Kavindu Dodanduwa <[email protected]>

* review changes - improve naming, change signature and improve documentation

Signed-off-by: Kavindu Dodanduwa <[email protected]>

---------

Signed-off-by: Kavindu Dodanduwa <[email protected]>
(cherry picked from commit 4ba7d1c)
@mergify mergify bot added the backport label Jan 7, 2025
@mergify mergify bot requested review from a team as code owners January 7, 2025 18:47
@mergify mergify bot requested review from rdner and faec and removed request for a team January 7, 2025 18:47
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jan 7, 2025
@elastic elastic deleted a comment from botelastic bot Jan 7, 2025
@Kavindu-Dodan Kavindu-Dodan added the Team:obs-ds-hosted-services Label for the Observability Hosted Services team label Jan 7, 2025
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jan 7, 2025
@elasticmachine
Copy link
Collaborator

Pinging @elastic/obs-ds-hosted-services (Team:obs-ds-hosted-services)

@Kavindu-Dodan Kavindu-Dodan enabled auto-merge (squash) January 7, 2025 19:38
@Kavindu-Dodan Kavindu-Dodan merged commit 5f67e46 into 8.x Jan 7, 2025
22 checks passed
@Kavindu-Dodan Kavindu-Dodan deleted the mergify/bp/8.x/pr-41817 branch January 7, 2025 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Team:obs-ds-hosted-services Label for the Observability Hosted Services team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants